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(54) Acoustic-assisted image processing 

(57) Acoustic-assisted image processing is 
achieved, in accordance w\h the invention by a novel 
method and apparatus in which an audio signal is sam- 
pled at an audio-domain sampling rate; a first viseme 
sequence is generated at a first rate in response to the 
sampled audio signal, the first rate corresponding to an 
audio-domain sampling rate; the first viseme sequence 
is transformed into a second viseme sequence at sec- 
ond rate using a predetermined set of transformation cri- 
teria, the second rate corresponding to a video-domain 
frame rate; and an image is processed in response to 
the second viseme sequence. In an illustrative example 
of the invention, a video image of a face of a human 
speaker is animated using a three-dimensional wire- 
frame facial model upon which a surface texture is 
mapped. The three-dimensional wire-frame facial mod- 
el is structurally deformed in response to a rate-trans- 
formed viseme sequence extracted from a speech sig- 
nal so that the mouth region of the video image moves 
in correspondence with the speech. Advantageously, 
the animation is accomplished in real time, works with 
any speaker, and has no limitations on vocabulary, nor 
requires any special action on the part of the speaker. 
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Description 
Technical Field 



This invention relates to speech recognition and im- s 
age animation. More particularly, this invention relates 
to acoustic-assisted image processing. 

Background of the Invention 

10 

Lip reading is an ancient art which has proven val- 
uable as an aid for hearing impaired people in under- 
standing verbal communication. In modern times, the 
idea ot using visual information to assist in understand- 
ing verbal communication has been successfully ex- is 
tended as a way of increasing the accuracy of speech 
recognition by machines. However, there have been no 
entirely satisfactory applications of the reverse situation, 
that is= using verbal communication to assist the 
processing of visual information by machines. More 20 
specifically, it would be desirable to use acoustic infor- 
mation, such as speech, to assist in the animation of 
video images. It would particularly be desirable to be 
able to do such animation in real time, with any speaker, 
without limitations on vocabulary, and without requiring 
any special actions by the speaker 

Summary of the Invention 

Acoustic-assisted image processing is achieved, in 30 
accordance with the invention, by a novel method and 
apparatus in which an audio signal is sampled at an au- 
dio-domain sampling rate; a first viseme sequence is 
generated at a first rate in response to the sampled au- 
dio signal, the first rate corresponding to the audio-do- 35 
main sampling rate; the first viseme sequence is trans- 
formed into a second viseme sequence at second rate 
using a predetermined set of transformation criteria, the 
second rate corresponding to a video-domain frame 
rate; and an image is processed in response to the sec- ^0 
ond viseme sequence. 

In an illustrative example of the invention, a video 
image of a face of a human speaker is animated using 
a three-dimensional wire-frame facial model upon which 
a surface texture is mapped. The three-dimensional ^5 
wire-frame facial model is structurally deformed in re- 
sponse to a rate-transformed viseme sequence extract- 
ed from a speech signal so thai the moulh region of the 
video image moves in correspondence with the speech. 
Advantageously, the animation is accomplished in real so 
time, works with any speaker, and has no limitations on 
vocabulary^ nor requires any special action on the part 
of the speaker 

Brief Description of the Drawing 55 

FIG. 1 is simplified block diagram of an illustrative 
example of an acoustic-assisted image processor, in ac- 
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cordance with the invention. 

FIG. 2 shows details of the viseme sequence gen- 
erator shown in FIG. 1. 

FIG. 3 is a simplified block diagram showing details 
of the operation of the viseme acoustic feature extractor 
shown in FIG. 2 

FIG. 4 is a simplified block diagram showing details 
of the operation of the viseme recognizer shown in FIG. 
2. 

FIG. 5 shows six feature points used in animating a 
facial image. 

FIG. 6 is a simplified flow chart which shows the op- 
eration of the sequence transformer shown in FIG. 1 . 

FIG. 7 is a simplified block diagram which illustrates 
a weighted moving average process, in accordance with 
the invention. 

FIG. 6 shows an illustrative example ot a 3-D wire 
frame facial model. 

FIGs. 9 and 1 0 show two exemplary 3-D wire frame 
models illustrating some principles of the invention. 

FIGs. 11 and 12 show the 3-D wire frame images 
shown in FIGs. 9 and 10 in which a surface texture has 
been applied. 

FIG. 1 3 shows an illustrative example of a telecom- 
munications system, incorporating an aspect of the in- 
vention. 

FIG. 14 shows another illustrative example of a tel- 
ecommunications system, incorporating an aspect of 
the invention. 

Detailed Description 

The present invention discloses a method an appa- 
ratus for synthesizing an animated video image using 
parameters extracted from an audio signal. In a first il- 
lustrative example of the invention, the animated facial 
image of a speaker is synthesized in response to a 
speech signal. Such an illustrative example of the inven- 
tion provides a number of advantages, for example, by 
allowing rapid and accurate machine-generated anima- 
tion of cartoons or video games. Alignment of an actor's 
voice with a cartoon character's mouth has been con- 
sidered to be one of the most challenging and time con- 
suming processes associated with cartoon and video 
game production since such animation is traditionally 
done by hand. These and other advantages of the first 
illustrative example ot the invention will become appar- 
ent in lighl of the description thai follows. 

FIG. 1 is a simplified block diagram of an acoustic- 
assisted image processor 100, in accordance with the 
invention. Being acoustic-assisted, it will be apparent 
that image processor 1 00 operates in both the audio and 
video-domains. Image processor 100 comprises vise- 
me sequence generator 120, viseme sequence trans- 
former 130, structural deformation generator 150, and 
texture mapper 160, coupled, as shown, in a serial ar- 
rangement. Details as to the operations of each of these 
components is discussed in turn below. For clarity of ex- 
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position the illustrative examples of the present inven- 
tion are presented as comprising individual functional 
and operational blocks. The functions and operations 
that these blocks represent may be provided through the 
use of either shared or dedicated hardware, including, 
but not limited to, hardware capable of executing soft- 
ware. For example, the functions of the acoustic-assist- 
ed image processor 1O0 in FIG. 1 may be provided by 
a single shared processor. It should be noted that the 
term "processor" should not be construed to refer exclu- 
sively to hardware capable of executing software. 

As shown in FIG. 1 , an audio signal is input to image 
processor 100 on line 110. The audio signal in this illus- 
trative example is a continuous waveform representing 
speech. Viseme sequence generator 120 generates a 
sequence of visemes in response to the audio signal. A 
viseme is a sequence of oral-facial movements, or 
mouth shapes, which corresponds to certain articulate 
linguistic-based units, such as audio phonemes. Vise- 
mes are known, being described, for example by K.W. 
Berger, Speechreading: Principles and Methods . Na- 
tional Education Press, 1972. 

The details of viseme sequence generator 120 are 
shown in FIG. 2. Viseme sequence generator 120 com- 
prises a viseme acoustic feature extractor 21 0 and vise- 
me recognizer 220 coupled in a serial arrangement. 
"Viseme acoustic feature extractor 210 extracts an 
acoustic feature vector sequence from the continuous 
: speech signal input on line 1 1 0 and outputs the acoustic 
I vector sequence on line 215. Viseme recogni7er 220 
generates a sequence of visemes from the acoustic fea- 
ture vector sequence output from viseme acoustic fea- 
ture extractor 210. 

FIG. 3 is a simplified block diagram showing details 
of the operation of viseme acoustic feature extractor 21 0 
shown in FIG. 2. Referring to FIG. 3, the continuous 
speech signal is sampled and pre-emphasized in oper- 
ational block 310^ according to: 

S{n) = S(n) - aS(n-1) 
where S(n) is the sample^ speech signal and a=0.95 in 
this illustrative example. S(n), the pre-emphasized sam- 
pled speech signal is blocked into frames in operational 
block 320. A Hamming window is applied in operational 
block 330, having a 30 msec width and a 10 msec shift. 
The resulting feature vector sequence is output on line 
335 at the audio-domain rate ot 100 samples per sec- 
ond. Ot course, those skilled in the art will appreciate 
that other audio-domain sampling rales may just as 
readily be utilized according to requirements of a partic- 
ular application of Ihe invention. A lOlh order aulo-cor- 
relation and linear predictive coding ("LPC") cepstral 
analysis is performed on the feature vector, respectively, 
in operational blocks 340 and 350. LPC cepstral analy- 
sis is known, and is described for example, by C.H. Lee 
ct al.. "Improved Acoustic Modeling for Speaker Inde- 
pendent Large Vocabulary Continuous Speech Recog- 
nition," Computer Speech and Language, 103-127, 



1992- The output of the LPC analysis on line 355 is ce- 
pstral weighted in operational block 360 to form the first 
order cepstrum feature vector Higher order cepstral 
features and energy (i.e. A.AA cepstrums, A.AA energy) 

5 are added to the first order cepstrum feature vector in 
operational block 370. The acoustic feature vector se- 
quence on line 375 is then processed by viseme recog- 
nizer 220 (FIG. 2). 

Referring to FIG. 4, there is shown a simplified block 

10 diagram which illustrates the operation of viseme rec- 
ognizer 220. In operational block 410, viseme recogniz- 
er 220 decodes the acoustic feature vector sequence 
using, for example, the known Viterbi decoding and 
alignment scheme, according to viseme identities from 

15 store 420. Viseme identities are described, for example, 
by the known continuous density hidden Markov model 
("HMM'). The feature vector sequence can be decoded 
in a frame-synchronous or asynchronous manner in op- 
erational block 410. 

20 II will be appreciated that visemes correspond lo 
very short acoustic events, often at the sub-phoneme 
level. Therefore, in accordance with the principles of the 
invention, a fine temporal resolution is utilized in order 
to accurately identify visemes from the audio signal. In 

2S this illustrative example of the invention, as described 
above, the viseme acoustic feature extractor 210 out- 
puts a feature vector sequence at the audio-domain 
sampling rate of 100 samples per second. Viseme se- 
quence generator 120 thus generates a viseme se- 

30 quence at this rate. By comparison, as will be appreci- 
ated by those skilled in the art, the video-domain frame 
rate is typically only 15 to 30 frames per second. In order 
to resolve this rate mismatch, sequence transformer 
1 30 (FIG. 1 ) transforms the high rate audio-domain vise- 

35 me sequence on line 125 into a low rate video-domain 
viseme sequence. Sequence transformer 120 carries 
out this function according to predetermined criteria, 
which may include, for example, physiological acoustic 
rules of visemes in the audio-domain, visual perception 

40 of visemes in the video-domain, and other knowledge- 
based criteria. These predetermined transformation cri- 
teria could be stored, for example, using a criteria store 
140 which is coupled to sequence transformer 1 30 via 
line 145, as shown in FIG. 1. In addition to rate transfor- 

45 mation, sequence transformer 1 20 may also perform im- 
age smoothing and error correction according to prede- 
termined knowledge-based rules. In response tothe low 
rale video-domain viseme sequence, sequence trans- 
former 130 outputs a sequence of mouth parameters 

50 representing a mouth shape on line 1 47. The mouth pa- 
rameters are output at the video-domain frame rale. In 
this illustrative example, the video frame rate is 30 
frames per second, thus, mouth parameters are output 
at a rate of 30 mouth parameters sequences per sec- 

55 ond. The mouth parameters are stored in a look-up table ^\ 
of mouth parameters. One example of a suitable look- | 
up table *is shown in Table 1 . Table 1 shows that the - 
mouth parameters include the coordinates of six lip fea- 
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ture points around the mouth which are shown In FIG. 

5. Although six lip feature points are utili7ed in this illus- 
trative example of the invention it, is intended that the'i 
scope of the invention include other numbers of feature \ 
points. Moreover, those skilled in the art will appreciate 
that it may be desirable to control other feature points 
around portions of the face in some applications of the 
invention. For example, the eyes and head can also be 
controlled in order to make the final animated image 
more natural in appearance. 

FIG. 6 is a simplified flow chart which shows the op- 
eration of the sequence transformer 130 shown in FIG. 
1 . The transformation from audio-domain to video-do- 
main rates is accomplished, in accordance with the in- 
vention, in three steps: rate conversion, weighted mov- 
ing average, and knowledge-based smoothing. The 
process is entered on line 610 where visemes, V}, are 
input at the audio-domain rate of 100 samples per sec- 
ond. At operational block 620, frame counter, c, and in- 
dices /, / and k are initialized to 0. In operational block 
620 the frame counter, c, is incremented by 0.3 for each 
viseme processed. The video frame number, /, which is 
the value of c after a truncation operation is performed, 
is computed in operational block 630. At decision block 
640. a new frame is generated when the frame counter 
f is greater than the index k. If f \s less than k then the 
current incoming viseme is stored in a buffer as shown 
in operational block 650. Visemes are denoted by the 
index By in operational block 650. It will be apparent that 
the number of visemes stored in the buffer will vary be- 
tween 3 and 4. The indices /and j are incremented by 
1 in operational block 660 and control is passed to op- 
erational block 620. In^ operational block 670, a viseme 
in the video-domain, Vf, is determined by equating it to 
the incoming audio-domain viseme, VA. A weighted mov- 
ing average is applied to the video-domain viseme in 
operational block 6B0. FIG. 7 is a simplified block dia- 
gram which illustrates the weighted moving average 
process. The visemes Bq, Bp ...By stored in the buffer 
710 are decoded in block 720 using a viseme table 730, 
for example, Table 1 shown above. A weighted sum is 
applied to the decoded mouth parameters from block 
680 corresponding to the buffered visemes. and a new 
set of mouth parameters is produced. Returning to FIG. 

6. the weighted moving averaged mouth parameters 
from operational block 680 are subjected to knowledge- 
based smoothing in operational block 690. This opera- 
lion is based on the physiological characlerislics of a 
human speaker. For example, human articulation is lim- 
ited by physical laws, thus, it is impossible for the mouth 
to move from one extreme position to another instanta- 
neously. In a rapid talking situation, the mouth shapes 
will move to an intermediate position to prepare tor the 
next transition before the next viseme is processed. Ac- 
cordingly, the knowledge-based smoothing operation 
can be based on physiological articulation rules in the 
audio-domain, and visual perception of the mouth 
shapes in the video-domain. Additionally, unnatural high 
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frequency movements which may result from spuriously 
produced visemes in the audio-domain may be filtered 
out in the smoothing operation. After the knowledge- 
based smoothing in operational block 690. control is 
s passed to operational block 695 where the index k is 
equated to the frame number, The index y is reinitial- 
ized to zero, and the index / is incremented by one in 
operational block 697. Control is passed to operational 
block 620 and the above-described process is repeated. 
Advantageously, the mouth parameters are produced 
by sequence transformer 130 (FIG. 1) in real time. Ad- 
ditionally, it should be evident that since no "training" of 
acoustic-assisted image processor is required to pro- 
duce mouth shapes corresponding to the speech signal, 

^5 the practice of the present invention may advantageous- 
ly be performed with any speaker, without requiring any 
special actions by the speaker, and without limitations 
on vocabulary. 

Structural deformation generator 150 (FIG. 1) gen- 

20 erales signals for controlling a three-dimensional ("3-D") 
wire frame facial model in response to the mouth pa- 
rameters received on line 147. FIG. 8 shows an illustra- 
tive example of a 3-D wire frame facial model comprising 
a lattice of approximately 500 polygonal elements, of 

25 which approximately 80 arc used for the mouth portion. 
The 3-D wire frame facial model is manipulated to ex- 
press facial motions by controlling the lattice points of 
the wire frame using conventional deformation or mor- 
phing methods. One such method is described by K. 

30 Aizawa et ai., "Model-Based Analysis Synthesis Image 
Coding (MBASIC) System for a Person's Face." Signal 
Processing: Image Communications T 139-152, 1989. 
It is unnecessary to control all of the lattice points on the 
3-D wire frame 200 independently because motion of 

35 one lattice point influences neighboring lattice points. 
Accordingly, in this illustrative example of the invention, 
the six lattice points corresponding to the six feature 
points shown in FIG. 5 are controlled by structural de- 
formation generator 150 using the coordinates con- 
tained in the mouth parameters received on line 147. 
The sequence of mouth parameters received on line 
147 thus describes a sequence of mouth movements on 
the 3-D wire frame facial model. Structural deformation 
generator 150 operates in the video-domain, which in 

^5 this illustrative example, is 30 frames per second. Ac- 
cordingly, a vid^o sequence of 3-D wire frames, where 
the video sequence describes a wire-frame image hav- 
ing an animated mouth region, is oulpul on line 155 by 
structural deformation generator 150 at 30 video frames 

50 per second. FIGs. 9 and 10 show two exemplary video 
frames which illustrate this animation. 

Texture mapper 1 60 receives the animated 3-D wire 
frame image video sequence on line 155. Texture map- 
per 160 projects or maps a stored surface texture from 

55 texture store 165 onto to the 3-D wire frame image in 
each video frame to create a final synthesized animated 
video image. Texture mapping is known in the art and is 
not described in detail herein. FIGs 11 and 1 2 show the 
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3-D wire frame images shown in FIGs. 9 and 10 in which 
a surface texture has been applied. The animated video 
images are output at the video-domain frame rate of 30 
frames per second on line 170. 

FIG. 13 shows an illustrative example of a telecom- 
munications system 1300, incorporating an aspect of 
the invention. An audio signal, for example, a speech 
signal, is input on line 1310 to audio encoder 1320. Au- 
dio encoders are known, and are typically used to digi- 
tize and/or compress an audio signal into a digital bil- 
stream that utilizes less bandwidth in the telecommuni- 
cations system. The encoded audio signal is then trans- 
mitted over transmission system 1330 to a remote audio 
decoder 1 340. Audio decoders are also known, and are 
typically used to reconstitute the original audio from the 
compressed bitstream. Audio decoder outputs the re- 
constituted original audio signal on line 1350 to some 
device (not shown) such as a telephone, voice-mail sys- 
tem, and the like. The reconstituted audio signal is also 
received by the acouslic-assisled image processor 100 
shown in FIG. 1 above. Acoustic -assisted image proc- 
essor outputs a video signal to some video display de- 
vice such as a monitor, videophone, and the like. Those 
skilled in the art will appreciate that portions of the 
acoustic-assisted imago processing could also be per- 
formed at the transmission side of the telecommunica- 
tions system 1 300. For example, viseme sequence gen- 
erator 120 (FIG. 1) and viseme sequence transformer 
130 (FIG. 1) could be located on the transmitter side and 
coupled to receive the original audio signal. Mouth pa- 
rameters would then be transmitted over transmission 
system 1 330 to the structural deformation generator 1 50 
(FIG .1) and texture mapper 160 (FIG. 1) which would 
be located on the receiver side of telecommunications 
system 1300. The mouth parameters could be sent via 
a separate circuit to the receiving side, or be multiplexed 
with the encoded audio signal. 

FIG. 14 shows an illustrative example of a telecom- 
munications system 1400, incorporating an aspect of 
the invention. This example is similar to the system 
shown in FIG. 13, however, a video encoder 1410 is in- 
cluded on the transmitter side of the telecommunica- 
tions system 1400. Video encoder 1410 receives a vid- 
eo signal on line 1405. The video signal on line 1410 
could be. for example, the facial image ot a speaker. Vid- 
eo encoder encodes the video signal and the encoded 
video signal is transmitted via transmission system 1 440 
to a video decoder 1 420. AUernalively. the encoded vid- 
eo signal may be transmitted on transmission 1 330, us- 
ing a different circuit, or multiplexed with the encoded 
audio signal on the same circuit. Video encoders and 
decoders are known. Video decoder 1420 reconstitutes 
the original video signal and outputs it to acoustic-as- 
sisted image processor 100. Using known techniques 
such as feature recognition and tracking acoustic-as- 
sisted image processor 100 can register the 3-D wire 
frame facial model to the original facial image The orig- 
inal facial image is also used as the surface texture for 
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the 3-D wire frame facial model rather than the texture 
stored in texture store 165 (FIG. 1 ). The animated video 
signal is output on bus 1 360, as above, to a device such 
as a video monitor. Acoustic-assisted image processor 
100 thus animates an image which appears to be the 
speaker. Advantageously such an animation scheme 
can offer significant transmission bandwidth savings 
over conventional videophone because, at a minimum, 
only one frame of video needs to be transmitted to 
acoustic-assisted image processor 100. This single 
frame, or "snapshot" can be sent, for example, at the 
beginning of the audio signal transmission on a separate 
circuit, or multiplexed with the audio signal. Optionally, 
additional video frames can be transmitted from video 
encoder 1410 periodically to refresh the animated im- 
age or assist in error correction. Even with the periodic 
refresh frames, bandwidth savings are significant. This 
illustrative example of the invention may be desirable as 
a means to provide visual cues to augment the under- 
standing of the audio signal by hearing-impaired people, 
for example. Of course, video information may be useful 
in other contexts as well since it allows for more person- 
alized communication. Speaker identification is also en- 
hanced by the addition of video information which may 
be advantageous in such applications as credit card au- 
thorization, home-shopping, airline and car reserva- 
tions, and the like. 

It will be understood that the particular techniques 
described above are only illustrative of the principles of 
the present invention, and that various modifications 
could be made by those skilled in the art without depart- 
ing from the scope of the present invention, which is lim- 
ited only by the claims that follow. 



Claims 

1. A method, comprising the steps of: 

sampling an audio signal at an audio-domain 
sampling rate; 

generating a first viseme sequence in response 
to said sampled audio signal at a first rate cor- 
responding to said audio-domain sampling 
rate; 

transforming said first viseme sequence into a 
second viseme sequence at a second rate 
according to a predetermined set of transfor- 
mation criteria, said second rate corresponding 
to a video-domain frame rate; and 
processing an image in response to said sec- 
ond viseme sequence. 

2. The method as claimed in claim 1 wherein said 
audio-domain sampling rate is 100 samples per 
second. 

3. The method as claimed in claim 1 wherein said 
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video-domain sampling rate is selected from the 
group consisting of 30 frames per second or 15 
frames per second. 

4. The method as claimed in claim 1 wherein- said 
transformation criteria include knowledge-based 
rules. 

5. The method as claimed in claim 4 wherein said 
knowledge-based rules include physiological artic- 
ulation rules. 

6. The method as claimed in claim 1 wherein said 
transformation criteria include a visual perception 
of said processed image. 

7. The method as claimed in claim 1 wherein said 
transforming includes applying a weighted moving 
average to each viseme in said first viseme 
sequence. 

8. The method as claimed in claim 1 wherein said 
image is a video image. 

9. The method as claimed in claim 1 wherein said 
image includes an image of a facial region of a 
speaker. 

10. The method as claimed in claim 9 wherein said 
processing includes animating the mouth region ot 
said facial image. 

11. The method as claimed in claim 9 wherein said 
processing includes animating the eye region of 
said facial image. 

12. The method as claimed in claim 1 wherein said 
processing includes animating the head region of 
said facial image. 

13. The method as claimed in claim 10 wherein said 
animating includes controllably deforming a three- 
dimensional wire-frame facial model corresponding 
to said facial image. 

14. The method as claimed in claim 1 3 further including 
mapping a surface texture onto said three-dimen- 
sional wire-frame facial model. 

15. A method comprising the steps of: 

encoding an audio signal at the transmission 
side of a transmission system; 
transmitting said encoded audio signal over 
said transmission system; 
decoding said transmitted encoded audio sig- 
nal at the receiving side of said transmission 
system; 



sampling said decoded audio signal at an 
audio-domain sampling rate; 
generating a first viseme sequence in response 
to said sampled audio signal at a first rate cor- 
5 responding to said audio-domain sampling 

rate; 

transforming said first viseme sequence into a 
second viseme sequence at a second rate 
according to a predetermined set of transfor- 
^0 mation criteria, said second rate corresponding 

to a video-domain frame rate; and 
processing an image in response to said sec- 
ond viseme sequence. 

75 16. The method as claimed in claim 1 5 further including 
encoding a video signal at the transmission side of 
said transmission system. 

17. The method as claimed in claim 16 further including 
^0 Iransmitling said encoded video signal over said 

transmission system. 

18. The method as claimed in claim 16 further including 
decoding said transmitted encoded video signal. 

2S 

1 9. The method as claimed in claim 1 8 further including 
registering a three-dimensional wire-frame model 
to said decoded video signal. 

30 20. The method as claimed in claim 1 9 further including 
applying a surface texture contained in said 
decoded video signal to said three-dimensional 
wire-frame model. 

35 21. The method as claimed in claim 15 wherein said 
transforming is performed at the transmission side 
of said transmission system. 

22. An apparatus, comprising: 

40 

means for sampling an audio signal at an audio- 
domain sampling rate; 

means for generating a first viseme sequence 
in response to said sampled audio signal at a 
first rale corresponding to said audio-domain 
sampling rate: 

means for transforming said first viseme 
sequence into a second viseme sequence al a 
second rate corresponding to a video-domain 
^0 frame rate according to a predetermined set of 

transformation criteria; and 
means for processing an image in response to 
said second viseme sequence. 

55 23. An apparatus comprising: 

a viseme sequence generator for generating a 
first viseme sequence in response to a sampled 
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audio signal at a first rate corresponding an 
audio-domain sampling rate; 
a viseme sequence transtormer coupled to said 
viseme sequence generator lor transforming 
said first viseme sequence into a second 5 
viseme sequence at a second rate correspond- 
ing to a video-domain frame rate according to 
a predetermined set of transformation criteria; 
and 

an image processor coupled to said viseme 
sequence transformer for processing an image 
in response to said second viseme sequence. 



The apparatus as claimed in claim 34 wherein said 
image processor includes a structural deformation 
generator for controltably deforming a three-dimen- 
sional wire-frame facial model corresponding to 
said facial image. 

The method as claimed in claim 35 wherein said 
image processor includes a texture mapper for 
mapping a surface texture onto said three-dimen- 
sional wire-frame facial model. 



24. The apparatus as claimed in claim 23 wherein said 
audio-domain sampling rate is 100 samples per ts 
second. 

25. The apparatus as claimed in claim 23 wherein said 
video-domain sampling rate is selected from the 
group consisting of 30 frames per second or 15 20 
frames per second. 

26. The apparatus as claimed in claim 23 wherein said 
transformation criteria include knowledge-based 
rules. 



27. The apparatus as claimed in claim 26 wherein said 
knowledge-based rules include physiological artic- 
ulation rules. 

30 

28. The method as claimed in claim 23 wherein said 
transformation criteria include a visual perception 
of said processed image. 

29. The apparatus as claimed in claim 23 wherein said 35 
viseme sequence transformer includes a means for 
applying a weighted moving average to each 
viseme in said first viseme sequence. 

30. The apparatus as claimed in claim 23 wherein said 
image is a video image. 



31. The apparatus as claimed in claim 23 wherein said 
image includes an image of a facial region of a 
speaker. 

32. The method as claimed in claim 31 wherein said 
image processor includes a means lor animating 
the mouth region of said facial image. 

50 

33. The apparatus as claimed in claim 31 wherein said 
image processor includes a means for animating 
the eye region of said facial image. 



34. The apparatus as claimed in claim 23 wherein said 55 
image processor includes a means for animating 
the head region of said facial image. 
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(57) Acoustic-assisted image processing is 
achieved, in accordance with the invention by a novel 
method and apparatus in which an audio signal is sam- 
pled at an audio-domain sampling rate; a first viseme 
sequence is generated at a first rate in response to the 
sampled audio signal, the first rate corresponding to an 
audio-domain sampling rate; the first viseme sequence 
is transformed into a second viseme sequence at sec- 
ond rate using a predetermined set of transformation cri- 
teria, the second rate corresponding to a video-domain 
frame rate; and an image is processed in response to 
the second viseme sequence. In an illustrative example 
of the invention, a video image of a face o1 a human 
speaker is animated using a three-dimensional wire- 
frame facial model upon which a surface texture is 
mapped. The three-dimensional wire-frame facial mod- 
el is slruclurally deformed in response to a rale-lrans- 
formed viseme sequence extracted from a speech sig- 
nal so that the mouth region of the video image moves 
in correspondence with the speech. Advantageously, 
the animation is accomplished in real time, works with 
any speaker, and has no limitations on vocabulary, nor 
requires any special action on the part of the speaker 
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